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ABSTRACT 

Because research synthesis enables one to determine^ 
either the overall effectiveness of a parYicin^ar treatment or the 
relative effectiveness of different typei^ o£ treatments, it is 
becoming increasingly popular as a tool in program evaluation. 
Numerous methodological problems arise, however, when research 
synthesis is applied to studies conducted in field settings. The 
present paper categorizes and discusses these problems as being 
threats to either the (1) internal validity (whether one can draw 
conclusions about cause and effect), (2) statijstical conclusion 
validity (whether one's inferential statistics are capable of 
detecting cause-and-^ef f eqt relationships), (3) construct validity 
(whether one's treatments and outcome measures are valid 
operationalizatiohs of the independent and dependent variables of 
interest), or (4) exterjfial validity (whether one can generalize 
results to particular populations, settings, or time periods) of 
research synthesis (see Cook and Campbell, 1979) . -Specif id 
recommendations ate made for minimizing these threats to validity, in 
order to improve the quality of research synthesis in program 
evaluation. (Author) 
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^ Abstract 

Befc'ause research synthesis enables one to determine either 
the overall effectiveness of a particular treatment or the 
relative effectiveness of different types of treatments, it is 
becoming increasingly popular as a tool in program evaluation, 
Numerous methodological problems' arise, however, when research 
synthesis is applied to studies conducted in field settings. The 
present paper categorizes and discusses these problems as being 
threats to either the (1) internal validity '(whether one can draw 
conclusions about cause and effect), (2) statistical conclusion 
validity (whether one's inferential statistics are capable of 
detecting cause~and-ef f ect relationships') , (3) construct validity 
(whether one's treatments and outcome measures are valid 
operationalizations of the independent ard ' dependent variables of 
interest), or (4^ external validity (whither one can generalise 
results to particular population's, settings, or time periods) of 
research synthesis (see Cook h Campbell, 1979). Specific 
recommendations are made for minimizing these threats to 
validity, in order to improve the quality of research synthesis in 
program evaluation. 
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• General /)v€^yvieW ' 

This'paper addresses strategies for improving thfe quality 
and utility of research synthesis in program evaluation. First, 
T will describe the advantages of research synthesis over other 
integrative techniques and will argue that these capabilities 
make it particularly ' useful for evaluating questions about 
program impact. T will suggest that one-way to promote. more 
excellent program evaluation is to improve the quality of 
research synthesis. Just as we can use validity criteria to 
improve the quality of primary research, T will a/gue that we may 
likewise improve the quality of research synthesis by controlling 
for threats to its validity. Finally, I will consider some 
threats to the validity of research synthesis and will suggest 
specific means of avoiding these pitfalls. 
The Strengths of Research Synthesis 

In this paper,' I will use the term "research synthesis" to 
denote a set of integrative techniques for combining the results 
fro^m independent empirical studies on a pa rt i cu la r t op i c or 
i33'Ue. Other writers have used a variety of terms for research 
synthesis, including meta-analysis Tqiass, ^^7'^\ Hunter, Schmidt, 
Jackson, quantitative review ^^looper ^ A.rkin, lO^^l), 

statistical review ^^rkin* 'looper, ^ Kolditz, 1Q80), integrative 
review' <^01iver ^< Spokane, 19^"^; Walberg h Haertel, in<^0), 
empirical cumulation (Taveggia, 19*70, data synthesis (Stock, 
Okun, Haring, Miller, Kinney, ^< Ceurvorst, lOS'^'), and evaluation 
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synthesis ^•'^orra, 103"^^ Although the -l-erminoloRy varies, all of 
these integrative techniques emphasize a similar quantitative 
approach to reviewing primary research. This generally involves 
extracting from the original research reports the p6sttest me^ns 
and standard deviations of treatment and control groups. These 
statistics are then combined to .obtain a standardized "effect 
size," by subtracting the control group's moan from the 'trea\tment 
group's mean and dividing the result by some estimate of the 
population standard deviation. T'his effect-size stsftistic 
quantifies the magnitude ' and direction of treatment effects in a 
"common m'etric" o^: stahdjird deviations, so that effect sizes can 
be pooled and compared across studies. By keeping track of 
conteX'tual variables within primary studies, such as 
characteristics of the sample, the setting, 'the treatment, the 
outcome measures, and the, research design, one can also search 
for variables that moderate treatment effects. 

This quantitative approach has 'distinct advantages over 
traditional methods of literature review. .For example* the 
traditional qualitative review is largely sub,iective and provides 
little or no statistical, information^ about the strength of 
observed effects. Furthermore, other methods of quantitative 
review, such as a simple "vote count" that categorizes studies* 
outcomes as positive, negative, or zero t^ffects, can produce 
misleading "no difference" conclusions, or Type II errors, 
bec^'.use of low statist,ioal power (Hedges Olkin, 1Q^0; Dight h~ 
Smith, 1Q71; Light h Pillemer, 1034). Research synthesis allows 



ERIC 



5 



\ 



a more systematic investigation o? the mean and variance of 
effect sizes. Thus, the main ^^^trength of research synthesis is 
that It provides a quantitative index of treatment effects 
expressed in a metric that is comparable across studies. 
Research Synthesis in Impact ^«valuation 

These capabilities make research synthesis a particularly 
useful tool for impact evaluation. Impact evaluation essentially 
provides information atout a program's effectiveness. This may 
involve either f'l questions about a program's overall impact 
^e.g., D'oes the program work*^ Is it having its desired effect? 

Are there any unanticipated side-efXecjLai) o-r — ^^V'q^ues^tTons . about 

a program's re la ti ve ' impact (e.g., What form of program is most 
effective and most cost-efficient? How should the program be 
implemented to maximize its effectiveness? For whom and in what 
settings does the program work best?^. 

Tn addressing questions 'about a program's overall impact, 
research synthesis enables one to "boil down" a set of primary 
studies into a single index of treatment effects. This 
'facilitates more effective cost-benefit analysis by quantifying 
benefits for program recipients in a standard unit that can be 
meaningfully related to program expenditures. Synthesizing the 
literature on school desegregation and black achievement, for 
example, Tlortman and Bryant fin^^^ found an overall average 
effect-size of t-.'^n. This outcome represents a gain for 
edeseg»regated students ('relative to segregated students) of 
roughly two months of growth in academic achievement' on a 
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standardized test. because this expresses the magnitude' of 
intellectual growth in a unit that is more meaningful than the 
number of points on an achievement test, the'policy maker > .^n 
better gauge the bene'^its of desegregation progranjs relative to 
their financial costs. 

Mthough research synthesis is helpful in determining a 
program*s overall impact, it is perhaps most useful in addressing 
questions about relative impact. Because effect sizes express 
each study's results in a common metric, one can use research 
synthesis to identify variables associated with stronger impact. 
Furthermore, the evaluator can use research synthesis to 
determine (3.^ the relative imoact of a particular type of program 
on different outcome measures or ^b) the relative impact of 
different types of programs on a particular outcome measure. ^s 
an illustration of how to determine a program's relative impact 
on different outcome measures, Messick and Jungeblut 
synthesized research on tha effects of coaching for the 
Scholastic Aptitude Test and found that increases in verbal 
scores required - more coaching time than equivalent increases in 
math scores* Ks an illustration of how to determine the relative 
impact of different programs, Shadish synthesized research 

on preventive child health care and found that specific 
interventions for specific problems were more effective than 
broad-scale interventions. Again, research synthesis improves 
cost-benefit analyses in these cases by making comparisons of 
relative cost efficiency more meaningful. 



Tmproving the Validity of Research Synthesis 

Given that research synthesis is a useful tool for 
evaluating program impact, then one way of promoting more 
excellent impact evaluation is to improve research synthesis* 
Accorclin^ly , the remainder of this paper addresses strategies for 
improving the validity .of research synthesis. Cook and Campbell 
f 1 QTq) have distinguished among four types of va lidity in primary 
research--internal , statistical conclusion, construct, and 
external validity. Just as Cook and Campbell (^^^^^ have urged 
researchers to use these validity criteria to improve primary 
research, T am proposing that we also use these same criteria to 
improve research synthesis. , 

Internal validity. Internal validity concerns the degree of 
confidence that one has in drawing con(/lusions about cause and 
effect fCamobell ^ Stanley, 1Q^^; Cook & Campbell, 197^). As 
with all ^orms of research, the conclusions drawn ftom research 
synthesis are only as good as the evidence on which they are 
based* If all the studies included in the synthesis are 
methodologically flawed, then the conclusions drjarwn from the 
synthesis will lack internal validity. ^ For t-his reason, it is 

important to keep track of threats bo the interfial validity of 

t 

each of the primary studies that are included *i,n the research 
synthesis. By coding studies for speci"f*ic threats to their 
internal validity, ^uch as selection, matura'feion, history, and 
J nstruinentation, one may systematically examine how these threats 
influence effect size. ^'or example, Nortman and Bryant's Mq*)^") 
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synthesis of research on school desegrefca tion revealed that 
studies which were ,iudged a priori as having problems with 
selection bias had significantly greater effect sizes than^ those 
- -without selection problems. If all the studies included in the 
synthesis suffer the same methodological flaw, however, it may be 
impossible to determine howr this particular threat influences 
effect size ^Cook ^ Leviton, 19S0; Jackson, I^BO"*. When there is/ 
little or np variance in methodological quality, one has no way 
of examining quality as an independent variable. One needs a 
sufficient number of high quality studies to use as a baseline 
against which to compare studies of poorer quality. Without this 
high quality baseline* the Internal validity of research 
synthesis is suspect. Therefore, when the range of 
methodological quality in the primary studies is restricted to 
the low end of the continuum, one may increase the internal 
validity o."^ research synthesis by using only those studies of 
relatively higher quality ^^ryant ^< Hortman, 

'^his represents a type of purposive sampling plan (Cook A 
Campbell, in'7q; -^Judman, 10'^'^^, whereby one chooses which studies 
to include on the basis of their methodological rigor rather than 
trying to insure representativeness. One's choice of sampling 
strategy, in research synthesis mgy thus someti*rves- depend on 
whether it is more important to draw unequivocal concclu.^ions that 
have 1 1mited genera lizability or. equivocal conclusions that ere 
widely generalizable. I will return to this notion of purposive 
sampling when T discuss external validity in research synthesis. 

'A 
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^statistical conclusion va lidity> Hhereas internal validity 
concerns the question of whether sorje aspect of the treatment 
pVbduced observed outcomes, statistical conclusion validity 
concerns. the question o^ whether one's inferential statistics are 
capable of detecting a cause-and-ef f ect relationship TCook ^ 
Campbell, 10'7Q^. Recent work on the statistical theory 
underlying: estimates of effect size sufcgests several stratefties 
^ or maximiiiin;: the validity of statistical conclusions in 
research synthesis. 

One way to improve Statistical conclusion validity in 

! 

research synthesis is to use estimators of effect- size that have 

less statistical bias. '^or example, one can obtain a less biased 

I 

estimator *of effect size by usin^s: the pooled within-f^r oups 
standard deviation as the unit of standardization in the 
denominator, rather than usinfc the control group's standard 
deviation, as is typically done ^Hedges, 1 0^1 , 199?; Hunter et 
al., furthermore, if the studies included have different 

sample sizes, then one can obtain a more precise estimate of 
overall treatment ef-f^ects by weighting each study's effect size 
according to the size o-f* its sample ^see Hedges, 1*^*^"^, and Hunter 
et al., -f^or formulas of weighted estimators'/, Other 

investigators have developed procedures to correct estimates of 
ef-f^ect size for un^'el iab II i ty in both the outcome measures of the 
primary studies ^Hunter et al., as well as the coding of 

variables in re?e«iroh synthesis ^Orwin Cordr*>y, 10*^*^). 

Special Droblems uith statistical conclusion val id-i ty^rise 
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vrhen the research li-^orature beinic: synthesized is qu^sJ- 
experimental f Bryant ^< Uortm^nt If treatment and control 

groups have not been rarTuomly assigned, then one cannot assume 
that these f^roups are e|iuivalent at the pretest. In these 
cises of selection bias, it may be unreasonable to use the 
traditional estimate of effect size (Cohen, IQ'^'^; Class, ^Q'7^• 
Class, McCav?, Smith, I^RO, which assumes pretest equivalence. 
However, i^ pretest measures are available, then one may 
calculate an effect size for th'j pretest and use it to ad.iust the 
Dosttest effect-size for initial betwee.n-fcroups differences. 
Uortman and Bryant HC35^ have shown that this preteat-ad.iusted 
effect size is a more accurate estim^atc of treatment effects in 
quasl-experiments than is the traditional posttest ef f ect-rsize. 

\nother way to improve statistical conclusion validity in 
research s'^v^'^hesis is to improve our procedures for identifying 
rela t ionsLjuPS between indeponr^ent and dependent variables, 
before pooling ef^fect sizes to calculate an overall effect-sise, 
for example, one can statistically test the homogeneity of 
studiea' outcomos; ^see Hodges* Hunter et al., Tf 

one ro.iects the null hypothesis that sampling error alone 
accounts for observed variation in effect sizes, then it is 
unlikely that all studios come from the same underlying 
population, and_one should distrust a single overall effeot-size. 
T'^one fails to re.iect this null hypothesis, on the other hand, 
theTi sampling error alone may account for observed variation in 
effect sizes, and it may be unreasonable to search for variables 



that moderate treatment effects, 'Purthermore, when doin;:; 
multiple statistical comDarisons in research synthesis, one 
should either correct the alpha level for the per conparison 
error rate (Ryan, or use multivariate "^tests that do so 

^Harris, 10-^5), to avoid the so-called "fishing problem" (Cook ^< 
Campbell, "Resides avoiding Type T errors of capitalizing 

on chance, one may avoid Type X"^ errors that result from low 
statistical power T^ohen, IO'^'t) ^y abstaining from research 
synthesis all together when too few studies exist on the 
particular topic rCook;^'- Loviton, 1<5^0), 

\n additional problem involves the unit of analysis, in 
research synthesis. Multiple outcomes from a single' study 
fe^g., multiple treatment or control groups, multiple dependent 
measures, or measures taken at multiple points in time^ muat be 
treated as being nonindependent. Thus on'e should either avt^rage 
multiple outcomes within studies to compute a.n^- overall effect- 
size or compare mul^.iple outcameS within studies to search for 
moderator variables fL\andman ^<\pawes, 19B^\ 

\ 

Construct validityX Construct val iditv concerns the degree 

to which the particular treatments and outcome measures aro valid 

operationalizations of the constructs supposedly underlying the 

independent and dependent variables fnook f< Campbell, To 

ft 

improve construct validity in research synthesis, one should at 
the outset oxplicitly specify the range of treatments, ^comparison 
pcroups, and outcome measures that will be considered relevant 
f^ryant Hortman, I'^^l; Cooper, 103?). Furthermore, in 
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orHer to avoid lumping together "apples and oranges" (Gallo, 
Wortman, one should consider different forms of 

treatment'' separately (Light Sf Pillemer, I^B^^ and should divide 
stutiies into clusters according to the measurement instruments* 
used (Zeldman, Previous theory and research may be useful 

in deciding how *to stratify treatments and outcome measures, 
T^or example, a recent synthesis of research on age differences in 
sub.iective well-being (Stock, Okun, Haring, ^ Witter, 
combined into one global index five related types of outcome 
measures--! if e satisfaction, happiness, morale, quality of life, 
and well-being. Recent theory and research on sub.iective mental 
health /Bryant Yeroff, Veit ^/ Hare, 1083^, however, 

suggest that these are clearly^ distinct constructs that should be 
c onsidered separate 1 v. 

' *-^erhaps the most serious threat to the construct validity of 
research synthesis is the difficulty of assessing the strength or 
"dosage'^ (Quay, 1^77; Sechrest, West, Phillips, Redner, ft Yeaton, 
lO'^O^ 0^ the treatment. Often one only finds significant main 
e-f^fects or interactions when the appropriate levels of 
independent variables have been implemented ^riooper /Vrkin, 
10«^1*>. i^or example, programs , designed to promote preventive 
health behaviors by* arousing fear may only work when they elicit 
moderate levels fear and Tnay be relatively ineffective wnen 
they involve either low or high levels of fear ^TaniS ft Feshbach. 

This suggests that research synthesis should incorporate 
qualitative information about the strength of the treatment as 
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implen]ented in the primary studies ^Light ^< ^illemer, 19B4; 
Morra, 1Q^3>. This would enable us to distinguish studies 
involving stronger treatments from studies involving milder 
treatments and to specify the level at which a particular 
treatment works best. / 

,K related strategy for improving c ohs truct / va 1 id i t y in 
^^search synthesis is to decompose the "treatment package" 
^^uay, 10'7'7) into its composite constructs. This involves 
identifying different conceptual components of a particular 
treatment program and keeping track of the levels at which each 
of these compone|kts has been implemented across, studies. This 
approach enables one to pinpoint the specific ingredients ^hat 
maximize a program's impact. Synthesizing research *on hospital 
patient education programs, * f or example, Devine and Cook (1^83)^/ 
identi'^ied three common components of treatment interventions:, 
M) providing patients with information about medical Procedures, 
attempting to increase patients* feelings of control, and (3) 
attemDting to reduce patients' levels of anxiety. The programs 
most effective in reducing length of stay were th(^se that 
incorporated all three of these components; programs that 

involved only oiie or two o^ these components were less effective. 

■/ 

This illustrates how decomposing a treatment into its composite 
constructs can help us specify precisely which of these 
constructs" are responsible for a pr ogra m's effectiveness. 

External va 1 idity. External validity concerns the degree of 

confidence that one has in generalizing results to different 

/ 
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populations, settings, or time periods (Campbell ^ *=?tanley, 1966; 
Cook ^ Campbell, 19*^9). External validity is especially 
problematic in research synthesis because there is no single 
definitive list of all existing ^studies on a given topic. This 
typically precludes an exhaustive sampling oi all existing 
studies and prevents one from determining the representativeness 
of one*s fiwal sample of studies (Feldman, 1970. 

In discussing internal validity, T suggested that one may 
decide to* sample only studies of relatively higher methodological 
quality when the range of design quality is restricted to^^jxherTfow 
end of the continuum and one wishes to pl^o^^^nTo^re we igh t on 
internal validity than on, -e-xiferna 1 validity. I will now propose 
two other types of purposive sampling strategies that can be used 
in research synthesis. 

The first purposive sampling strategy is to sample for modal 
instances ^Cook ^ Campbell, 19'^9; St. ^ierre ^ Cook, 199/1). 
'^his involves limiting the sample of studie^3 to those using the * 
most widely representative populations, settings, or forms of 
treatment implementation. This sampling plan provides program 
developers with information that is genera lizable sol ely to the 
pases that are most typical.* The task here is to define the 
variables across which one wishes to generalize and then to 
select instances at the mode of each of these variables (St. 
^ierre ^< Cook, 1994.'). Imagine, for example, that one has 
been commissioned to synthesize research on school desegregation 
for the legislature of a particular >Iew England State. Xu this 

15 
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case, one might decide to sample only studies, of two-way cross- j 
district busing programs in large, urban settings, if this was 
the modal instance for the part icular * state. 

Another type of purposive sampling strategy is to sample on 
implementation ^*^ook ^ Campbell, St. Pierre ^. Cook, 

This involves selecting only studies in which the particular 
program is developed and mature enough to be well-implemented or 
to hei trar).sf erable from one location to another. For example, in 
synthesizing research on alternative health care programs for 
state-subsidized nursing homes, one might decide to include only 
studies of programs that one feels have been developed deafly, 
and fully enough to be transferred to the particular sites oue, 
has in mind. Alternatively, one may chose to sample only studies 
in which the program has been either, particularly well-^ 
implemented, moderately well-Implemented, ar poorly implemented^ 

i 

to determine how well it works at different' levels of 
implementation. This represents one way o^ incorporating cruciall 

qualitative information about the integrity or fidelity of the 

i 

treatment f Cottf redson, Quay, 1077;; Sechrest et al., ^^7n|). 



A.S is tVie case with primary resear;ch, the ultimate test 'of 



the externa 1 validity of research sjfn thesis is independe^nt 
replication. Thus, in the long rurl^ ' the best way to enhance 
external validity in research synthesis may be to improve l^he 
ability of ot^hers to duplicate (1) ouir procedures for selecljing 
relevant studies and relevant comparisons within studies andj (2) 
our criteria for cpdi-ng quantitative and qualitative inf orma,ition 

/ 
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from primary studies. One strategy for improving replicability. 
is to make explicit the many subjective .iudgments that one must 
necessarily make in research synthesis ^Cooper, 1^B2; Uortman & 
Bryant, 1935). Without knowing the specific criteria by which a* 
particular researcher has resolved these inevitable 
uncer hainties, independent replication remains impossible. 

Another way to enhance the external validity of research 
synthesis is to establish formal, archives of published and 
unpublished reports on selected topics. In fact, this is 
currently being done in the field of education by the Educational 
Resources Information Center ^ERI*^). This helps to promote 
independent reanalyses by providing others with a comparable 
sample of studies for research synthesis. 

'^erhaps the most efficient method of improving external 
validity, however, would involve making public the actual raw 
data from, research syntheses. Just as archiving primary research 
data facilitates more effective secondary analysis (lowering, 
10R4; Bryant & Wortman, IQ'^S'), so may archiving the data from 
research synthesis promote more valid reanalyses of the same 
data base ^Bryant ^ Uortman, 19B4.). 

In conclusion, I have argued that we can improve the quality 
of research synthesis by controlling for threats to its internal, 
statistical conclusion, construct, and external validity. I have 
considered ma.ior threai:s to each type of validity and have 
suggested specific strategies for avoidixig these pitfalls. There 
'^re', however, other potential threats to validity in research 
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synth'^sis about which we know very little. For example, small 
sample sizes reduce statistical power and undermine statistical 
conclusion validity ^'^ook ^ Campbell, IQ'^Q^; however, no formal 
rules have been established for deciding on the minimum number of 
studies required for research synthesis. Future work should 
explore whethe^ power curves (Cohen, lO^v. Veldt Sc' Mahmoud, 1998) 
for determining the number of sub,1ects to include in primary 
research can be used to d/etermine the number of studies to 
include in research synthesis (Bryant Nortman, 19^4)« In 
addition, we know very little about how artifacts such as 
sampling error influence estimates of effect size. Monte Carlo 
"simulation" studies are clearly needed to test the 
susceptibility of statistical procedures in research synthesis to 
Type T and Type IT errors. Only by carefully considering sources 
of error and bias in research synthesis can we "naximize its 
ability to provide us with valid conclusions. 
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